chem16S: community-level chemical metrics for exploring genomic adaptation to environments

Abstract Summary The chem16S package combines taxonomic classifications of 16S rRNA gene sequences with amino acid compositions of prokaryotic reference proteomes to generate community reference proteomes. Taxonomic classifications from the RDP Classifier or data objects created by the phyloseq R package are supported. Users can calculate and visualize a variety of chemical metrics in order to explore the effects of redox, salinity, and other physicochemical variables on the genomic adaptation of protein sequences at the community level. Availability and implementation Development of chem16S is hosted at https://github.com/jedick/chem16S. Version 1.0.0 is freely available from the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/package=chem16S.


Introduction
Chemical analysis of protein sequences can reveal new aspects of genomic adaptation.Carbon oxidation state (Z C ) represents the degree of oxidation of carbon atoms that results from bonds with other atoms (i.e.H, N, O, and S in the primary sequences of proteins).Protein Z C tends to be lower in the genomes of methanogens that inhabit anoxic environments compared to those that are occasionally found in oxygenated environments (Dick et al. 2023).This suggests that geochemical relative stability models are applicable to genomic variation, but identifying the role of the environment is complicated by the cosmopolitan nature of many organisms.
Microbial communities represent a localized collection of genomes.Recently we described the combination of taxonomic abundances from the Ribosomal Database Project (RDP) Classifier (Wang et al. 2007) with reference proteomes derived from the Reference Sequence (RefSeq) database of the National Center for Biotechnology Information (NCBI) (O'Leary et al. 2016) to generate community reference proteomes (Dick and Tan 2023).This allows new tests of hypotheses about genomic adaptation.For instance, the thermodynamic prediction that protein Z C is positively correlated with environmental oxidation-reduction potential has been confirmed for bacterial communities at a global scale (Dick and Meng 2023).
Community reference proteomes are inferred by taxonomic comparisons with genomic databases rather than directly derived from protein extracts or community DNA (i.e.metaproteomes and metagenomes).Despite this, trends of Z C for community reference proteomes are mostly consistent with those for protein sequences inferred from shotgun metagenomes (Dick and Tan 2023).Different natural abundances and extraction efficiencies for cytoplasmic and membrane proteins in metaproteomic experiments can explain in part the relatively weak correspondence of Z C between metaproteomes and community reference proteomes (Dick and Meng 2023).For these reasons, metrics calculated for community reference proteomes are indicators of genomic variation rather than protein expression.
The code that was developed in our recent studies was consolidated to create the chem16S package.An important new addition to the package are reference proteomes derived from the Genome Taxonomy Database (GTDB) (Parks et al. 2022).Using the GTDB for both taxonomic classification and reference proteomes avoids the uncertain mapping between the RDP training set and the NCBI taxonomy.Moreover, functions have been added to chem16S to process data objects generated with the phyloseq package (McMurdie and Holmes 2013), allowing for seamless calculation and visualization of chemical metrics for microbial communities by more researchers.

Implementation
chem16S is written in R (R Core Team 2023) and is structured around three core functions.read_RDP reads taxonomic classifications from an output file of the RDP Classifier, removes domain-level and certain nonprokaryotic classifications (e.g.Chloroplast), then retains the lowest-level classification for each input sequence from genus to phylum level.map_taxa maps classifications to the NCBI taxonomy by automatic matching of both taxonomic rank and name together with manual mapping for some taxa (see Dick and Tan 2023); an option is available to modify the manual mappings at runtime.Then, get_metrics combines the abundances of mapped taxa with precomputed amino acid compositions of reference proteomes for taxa in order to obtain the community reference proteome for each sample.
Users can select from a variety of chemical metrics, including the aforementioned Z C .A projection from elemental composition to chemical composition by means of thermodynamic components (aka basis species) permits the calculation of stoichiometric oxidation state (n O 2 ) and hydration state (n H2O ).For protein sequences in a given genome, the elementbased metric Z C and reaction-based metric n O 2 strongly covary, but n H2O is largely decoupled from Z C (Dick 2022); salinity is one factor that may modulate the differences of n H 2 O between free-living communities (Dick et al. 2020).Other metrics available in chem16S are elemental ratios (H/C, N/C, O/C, and S/C), basic physicochemical quantities including protein length and molecular weight of amino acid residues, and derived quantities including grand average of hydropathicity (GRAVY) and isoelectric point (pI); the latter two are modeled after the ProtParam tool in UniProt (Gasteiger et al. 2005).
RefSeq-based reference proteomes and manual mapping between RDP and NCBI taxonomies were described previously (Dick and Tan 2023).However, inconsistencies between these taxonomies preclude a completely accurate mapping.For this reason, a new set of reference proteomes was generated from the Genome Taxonomy Database (GTDB release 207) using a script that is available in the package.Taxonomic assignments of 16S rRNA gene sequences made using a separately maintained GTDB training set (Alishum 2022) formatted for the DADA2 sequence processing package (Callahan et al. 2016) in principle ensure complete mapping to the GTDB reference proteomes, and a 100% mapping rate has been observed for analyses performed in this study (see below).
The phyloseq package supports analysis and visualization of microbiome data (McMurdie and Holmes 2013).chem16S can analyze taxonomic abundances in the OTU table of a "phyloseq-class" object; this table may hold either operational taxonomic units or amplicon sequence variants (ASVs), depending on the sequence processing pipeline used.Functions are provided to generate a table of lowest-level taxonomic classifications (ps_taxacounts), calculate chemical metrics for community reference proteomes (ps_metrics), and plot either individual metrics as a function of sample data (plot_ps_metrics; facets are used to visualize trends of multiple metrics) or two chemical metrics against each other (plot_ps_metrics2) using ggplot2 graphics (Wickham 2016).The plotting functions integrate the lower-level functions, so a single function call runs the calculations and makes a plot.

Applications
Two short examples illustrate the benefits of using chem16S to represent communities in terms of chemically interpretable variables.The mouse.GTDB "phyloseq-class" object was produced in this study by following the DADA2 Pipeline Tutorial version 1.16 (https://benjjneb.github.io/dada2/tutorial.html,accessed on 7 July 2023) modified to use the GTDB training set for taxonomic classification.The source data were downloaded from https://mothur.s3.us-east-2.amazonaws.com/wiki/miseqsopdata.zip(accessed on 7 July 2023) and represent a progression of gut microbiome composition in a single mouse from early (0-9 days) to late (141-150 days) times postweaning (Schloss et al. 2012).There is 100% mapping of taxonomic classifications to the GTDB-based reference proteomes, and the chemical analysis reveals anticorrelated trends in Z C and n H 2 O between early and late samples (Fig. 1A and B).A progressive decrease of whole-body body water content is a characteristic of mammalian development (Moulton 1923), so the decreasing trend of n H2O that is apparent in the community reference proteomes suggests a physicochemical link to host physiology.
The GlobalPatterns dataset that is part of the phyloseq package is based on data from Caporaso et al. (2011), in which taxonomy was assigned using the RDP Classifier.After the dataset is loaded, running the ps_metrics function displays messages about manual mappings to the NCBI taxonomy and percentages of the most abundant unmapped taxa and returns values for three default chemical metrics (Fig. 1A).The values of Z C delineate relatively reduced (e.g.tongue and ocean) and oxidized (in particular, soil) community reference proteomes and are largely decoupled from n H2O (Fig. 1C).Consistent with previous observations for metagenomes (Dick et al. 2020), the range of n H 2 O for ocean samples is lower than that for freshwater samples.The largest variation of n H2O does not occur between sample types but rather within the group of fecal samples, some of which exhibit extremely low n H2O .Unlike other sample types, fecal samples are characterized by a tight linear covariation between H/C and O/C with a slope that is somewhat >2, which likewise suggests that H 2 O is an influential thermodynamic component in the genomic variability among fecal communities.
The package includes three vignettes-accessible online at CRAN and locally after the package installed-that cover patterns of chemical metrics in the reference proteomes of prokaryotic organisms and viruses (metrics.Rmd), integration of chem16S with phyloseq (phyloseq.Rmd), and plots of two chemical metrics for visualizing trends within and between datasets (plotting.Rmd).In the latter vignette, differences of community Z C and n H 2 O between sediment in the eastern Pacific Ocean and hot springs in the Qinghai-Tibet Plateau (Fonseca et al. 2022, Zhang et al. 2023) are used to illustrate potential effects of redox conditions and salinity on genomic adaptation at a global scale.

Conclusions
The merger of taxonomic and genomic information in community reference proteomes allows microbial communities to be represented in chemical terms.chem16S makes it easy to use taxonomic classifications produced by the RDP Classifier or OTU/ASV tables in "phyloseq-class" objects to calculate and visualize chemical metrics for community reference proteomes and thereby to gain new insight into the role of physicochemical factors in genomic adaptation to environments.

Figure 1 .
Figure 1.(A) Code example showing calculation of chemical metrics for two datasets.The most abundant unmapped taxon in the GlobalPatterns dataset (Stramenopiles) is a eukaryotic order, so no mapping to archaeal or bacterial taxa is possible.(B) Individual chemical metrics for the mouse gut dataset plotted against days postweaning.(C) Pairs of chemical metrics for the GlobalPatterns dataset. chem16S